tools: passt/pasta head-to-head comparison harness#81
Conversation
Perf chase summary — voidbox SLIRP optimisation seriesOutcome from the head-to-head comparison this PR enables: voidbox throughput now matches pasta-in-netns, and the SLIRP-engine gap to qemu+passt collapsed from a misleading 122× to a real ~1.6× on apples-to-apples CRR. The work was committed locally on this branch but not pushed — these notes capture findings, methodology, and concrete diff sizes for future review. Want any of it pushed for review, ping me. NumbersTCP CRR (apples-to-apples per the spec)
Cumulative: −35% on CRR p50. Gap to qemu+passt: 2.6× → ~1.6×. TCP throughput (the real win)
Throughput nearly doubled (+96%). Voidbox is now line-rate against pasta-in-netns (12256 Mbps). Latency primitives unchanged at parity
CRR via the voidbox-network-bench harness (with
|
| Setup | CRR p50 |
|---|---|
voidbox-network-bench nc per iter, baseline |
10133 µs |
voidbox-network-bench nc per iter, after perf series |
10140 µs |
Identical, because that path is dominated by guest-side busybox-nc fork+exec, not by SLIRP. The single-process C-binary CRR (the crr-client tool added in this PR) is the apples-to-apples measurement.
What got optimised
5 perf commits on passt-comparison-harness (local only, not pushed):
| Commit | Title | CRR Δ |
|---|---|---|
419694a |
perf(virtio-net): hot-path cleanups + suppress redundant IRQ pulses |
-10% |
84ec9d0 |
perf(vmm): IRQ delivery via KVM_IRQFD instead of KVM_IRQ_LINE pair |
-12% |
9e5c6ef |
perf(vmm): KVM_IOEVENTFD for virtio-net TX queue notify |
-17% |
6d7e228 |
perf(virtio-net): lock-free RX hand-off via SegQueue (Option B) |
-5% |
a5aa44d |
perf(virtio-net): interrupt_status as Arc<AtomicU32> |
parity (architectural) |
Plus 2 diagnostic-tool commits:
| Commit | Title |
|---|---|
d761fad |
tools: crr-client + voidbox-side single-process CRR diagnostic |
56c2f3a |
tools: bench-qemu-slirp.sh — qemu+libslirp / qemu+passt CRR harness |
Highlights of each perf change
- Hot-path cleanups in virtio-net (
419694a): replaced per-frameVec::concatallocations with stack[u8; 8], hoistedavail.idxreads out of per-frame loops, batchedused.idxupdates per virtio spec. Suppressed redundantKVM_IRQ_LINEpulses on cycles where no new RX work was queued. - KVM_IRQFD (
84ec9d0): replaced theassert level=1+deassert level=0ioctl pair with a single 8-byte write to a registered eventfd. Kernel-side IRQ assertion bypasses ioctl round-trip. - KVM_IOEVENTFD (
9e5c6ef): the guest's TXQUEUE_NOTIFYMMIO write now signals an eventfd in-kernel; the vCPU continues running without exiting. Net-poll thread sees the eventfd via the existingEpollDispatchand runsprocess_tx_queueon its own schedule. Eliminates 1 KVM_RUN exit per packet TX'd by the guest. - Option B lock-free RX hand-off (
6d7e228):pending_rx: Arc<crossbeam_queue::SegQueue<Vec<u8>>>field onVirtioNetDevice. Net-poll thread pushes frames lock-free; vCPU drains in its native MMIO context via a newflush_pending_rxmethod. TheArc<Mutex<VirtioNetDevice>>device lock is no longer touched by net-poll on the per-packet path. Arc<AtomicU32>ISR (a5aa44d):interrupt_statusbecomes a directly-shareable atomic. Net-poll thread caches a clone at startup and reads/writes it without going through the device mutex. No measured perf delta on the single-vCPU benchmark (within noise) but unblocks future work that lets the dispatcher skip the lock for read-only MMIO accesses.
Final profile under sustained bulk throughput
After the series, with voidbox-network-bench --bulk-mb 200 --iterations 50, perf-agent on the voidbox process:
| Function | Flat % | Class |
|---|---|---|
__clone3 |
32.4% | bench harness host-side thread spawn |
handle_tcp_frame |
27.0% | 97% of which is TcpStream::write → kernel __GI___write |
kvm_ioctls::VcpuFd::run |
11.7% | KVM_RUN — guest execution |
process_guest_frame |
7.1% | 96% of which is __GI___write |
EventFd::write |
4.1% | our IRQFD + IOEVENTFD writes |
EpollDispatch::wait_with_timeout |
3.0% | epoll_wait |
vcpu_run_loop |
2.7% | vCPU main loop |
VirtioNetDevice::process_tx_queue |
0.6% | descriptor parsing — basically free |
Voidbox's own user-mode SLIRP code is sub-1% of CPU during bulk throughput. The handle_tcp_frame 27% flat is dominated by the kernel TCP send syscall, not user-space work. PMU shows IPC 0.673, cache-miss rate 34/1K (high) but on a low instruction volume — the misses live in the kernel/syscall paths, not in voidbox's NAT logic.
Stopping point
Further user-space optimisation has very little headroom on this workload. The next set of changes would need to be architectural, not point fixes:
io_uringfor syscall batching (replace per-packetwrite()/read())splice()/sendfile()zero-copy on the guest→host data path- MSI-X virtio + multi-queue for vCPU scaling
- Skip the host kernel entirely (TAP+passt-style)
Status
- 5 perf commits + 2 diagnostic-tool commits on
passt-comparison-harness(local). - Not pushed — flagged as
wip:style work pending review of approach. - Bench harness commits in this PR (
scripts/bench-pasta.py,scripts/bench-compare-pasta.py,scripts/bench-qemu-slirp.sh,tools/crr-client.c,tools/qemu-init.sh,examples/crr_singleproc_bench.rs,docs/passt-comparison.md) are reproducible — anyone can re-run the comparison.
Headline correction for the PR body: the original "voidbox 122× slower than pasta" claim was misleading — that was overwhelmingly guest-side nc fork+exec, not voidbox's NAT path. The corrected, apples-to-apples claim should be: voidbox SLIRP is ~1.6× slower than qemu+passt on TCP CRR before optimisation, and within ~10–15% (12 Gbps vs 12.2 Gbps) on throughput after the perf series.
There was a problem hiding this comment.
Pull request overview
This PR adds a set of performance-harness tools for comparing VoidBox’s SLIRP networking against passt/pasta, and also introduces substantial VMM/virtio-net changes aimed at reducing VM-exit and lock-contention overhead in the networking hot path.
Changes:
- Add a passt/pasta comparison harness (pasta-side bench runner + markdown comparator) plus a qemu SLIRP-vs-SLIRP CRR harness and a static CRR client.
- Add a VoidBox-side “single process CRR” example to isolate per-iteration process-spawn overhead.
- Optimize virtio-net/VMM networking by introducing a lock-free RX handoff, atomic interrupt status, and KVM irqfd/ioeventfd usage to reduce exits and contention.
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/perf-harness/qemu-init.sh | Guest /init for CRR runs; parses cmdline, configures net, runs client. |
| tools/perf-harness/crr-client.c | Static single-process CRR loop client (connect→req→resp→close). |
| tools/perf-harness/bench-qemu-slirp.sh | Boots a minimal qemu guest and measures CRR vs qemu libslirp/passt backends. |
| tools/perf-harness/bench-pasta.py | Runs throughput/RR/CRR workloads inside a pasta netns and emits JSON Report-like output. |
| tools/perf-harness/bench-compare-pasta.py | Produces side-by-side markdown comparison between voidbox and pasta JSON outputs. |
| src/vmm/mod.rs | net_poll_thread: add irqfd/ioeventfd paths, lock-free RX queueing, and IRQ pulsing changes. |
| src/vmm/cpu.rs | Flush pending RX frames on virtio-net MMIO entry to materialize RX without net-poll holding the device lock. |
| src/devices/virtio_net.rs | Introduce pending_rx SegQueue + atomic interrupt_status; batch used.idx updates; TX/RX hot-path alloc reductions. |
| examples/crr_singleproc_bench.rs | VoidBox-side CRR bench using the same static C client, run inside one guest process. |
| docs/passt-comparison.md | Documentation and usage for the comparison harnesses. |
| Cargo.toml | Add crossbeam-queue dependency for lock-free RX handoff. |
| Cargo.lock | Lockfile updates for crossbeam-queue. |
Comments suppressed due to low confidence (1)
src/devices/virtio_net.rs:776
reset()clearsrx_bufferbut does not clear the new lock-freepending_rxqueue. After a guest device reset (STATUS=0), stale frames already queued by the net-poll thread can still be injected into the RX ring, violating reset semantics. Drainpending_rxduring reset (pop until empty) or reinitialize it.
/// Reset device to initial state
fn reset(&mut self) {
debug!("virtio-net: device reset");
self.status = 0;
self.interrupt_status.store(0, Ordering::Relaxed);
self.driver_features = 0;
self.tx_avail_idx = 0;
self.tx_used_idx = 0;
self.rx_avail_idx = 0;
self.rx_used_idx = 0;
self.rx_queue = QueueState {
num_max: 256,
..Default::default()
};
self.tx_queue = QueueState {
num_max: 256,
..Default::default()
};
self.rx_buffer.clear();
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| conn, _ = srv.accept() | ||
| except socket.timeout: | ||
| break | ||
| start = time.perf_counter_ns() | ||
| with conn: | ||
| # one read + one write keeps it a true CRR round-trip | ||
| try: | ||
| conn.recv(1) | ||
| conn.sendall(b"x") | ||
| except OSError: | ||
| pass | ||
| samples.append((time.perf_counter_ns() - start) / 1000.0) |
| python3 - <<PY & | ||
| import os, signal, socket, threading, sys, time | ||
| port = int(os.environ.get("HOST_PORT", "$HOST_PORT")) | ||
| s = socket.socket() | ||
| s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) | ||
| s.bind(("127.0.0.1", port)) | ||
| s.listen(64) | ||
| sys.stderr.write(f"echo-server: bound 127.0.0.1:{port}\n"); sys.stderr.flush() | ||
| def loop(): | ||
| while True: | ||
| try: c, _ = s.accept() | ||
| except OSError: return | ||
| try: | ||
| c.recv(1); c.sendall(b"x") | ||
| except OSError: pass | ||
| finally: c.close() | ||
| threading.Thread(target=loop, daemon=True).start() | ||
| time.sleep(60) | ||
| PY |
| let server_thread = thread::spawn(move || { | ||
| let mut accepted = 0u32; | ||
| listener.set_nonblocking(false).ok(); | ||
| let deadline = std::time::Instant::now() + Duration::from_secs(120); | ||
| let (done_tx, _done_rx) = mpsc::channel::<()>(); | ||
| while accepted < iterations && std::time::Instant::now() < deadline { | ||
| match listener.accept() { | ||
| Ok((mut conn, _)) => { | ||
| let mut buf = [0u8; 1]; | ||
| let _ = std::io::Read::read(&mut conn, &mut buf); | ||
| let _ = std::io::Write::write_all(&mut conn, b"x"); | ||
| accepted += 1; | ||
| } | ||
| Err(_) => break, | ||
| } | ||
| } |
| # (Type::STREAM.nonblocking() needs the "all" feature flag) | ||
| socket2 = { version = "0.5", features = ["all"] } | ||
|
|
||
| # Lock-free MPMC queue used to hand virtio-net RX frames from the | ||
| # net-poll thread to the vCPU thread without taking the | ||
| # `Arc<Mutex<VirtioNetDevice>>` device lock on the hot path. | ||
| crossbeam-queue = "0.3" | ||
|
|
| /// Drain frames pushed into [`Self::pending_rx`] by the net-poll | ||
| /// thread and write them into the guest's RX descriptors. | ||
| /// | ||
| /// Same descriptor-walking shape as [`Self::try_inject_rx`], but | ||
| /// the input frames come from the lock-free SegQueue instead of | ||
| /// going through the (locked) network backend. The vCPU thread | ||
| /// calls this on every MMIO entry to virtio-net, materialising any | ||
| /// frames the net-poll thread queued since the last MMIO exit. | ||
| /// | ||
| /// Returns the number of frames written to the RX ring this call. | ||
| pub fn flush_pending_rx<M: GuestMemory + ?Sized>(&mut self, mem: &M) -> Result<usize> { | ||
| let mut frames: Vec<Vec<u8>> = Vec::new(); | ||
| while let Some(f) = self.pending_rx.pop() { | ||
| frames.push(f); | ||
| } | ||
| if !frames.is_empty() { | ||
| self.write_frames_to_rx_ring(frames, mem) | ||
| } else { | ||
| Ok(0) | ||
| } | ||
| } |
Two scripts and a doc, deferred deliverable from docs/superpowers/plans/2026-04-27-smoltcp-passt-port.md § "passt head-to-head methodology". scripts/bench-pasta.py Drives the same workload shape as voidbox-network-bench (g2h throughput, RR p50/p99, CRR p50) against pasta running in a network namespace. Outputs JSON in the same Report shape so bench-compare-pasta.py can diff the two side by side. pasta is launched with --config-net + --map-host-loopback (default: gateway IP) so connecting to the host gateway from inside the netns reaches the host's 127.0.0.1. Mirrors voidbox's SLIRP convention (10.0.2.2 → 127.0.0.1) closely enough for the apples-to-apples CRR metric. scripts/bench-compare-pasta.py Reads two JSONs and emits a markdown side-by-side. Auto-detects which file is which via the `backend` field. Reports the gap as 'voidbox N× faster/slower' so the direction is unambiguous. docs/passt-comparison.md Caveats + usage. Calls out that throughput numbers are NOT directly comparable (voidbox has VM/MMIO overhead pasta does not). CRR latency is the apples-to-apples metric: dominated by NAT-table operations on both sides. Tested locally: pasta CRR p50 ≈ 80 µs, voidbox CRR p50 ≈ 10.1 ms on the same host. The gap is dominated by voidbox's poll-thread cadence + virtio-mmio exits, not NAT-table cost — a useful actionable signal for follow-up perf work.
Pair of artefacts used to root-cause the apparent 122x voidbox-vs-pasta
CRR p50 gap reported by scripts/bench-pasta.py.
tools/crr-client.c
Static-linked C binary that performs N TCP CRRs in one process,
no fork or exec per iteration. Output is one line of nanoseconds:
N P50 P99 MEAN. Compile with:
gcc -O2 -static -o /tmp/crr-client tools/crr-client.c
examples/crr_singleproc_bench.rs
Voidbox-side driver. Boots a sandbox with /tmp host-mounted into
the guest, runs the static binary inside the guest, parses the
one-line output. Measures voidbox's NAT-path CRR cost without the
outer bench's per-iteration nc fork+exec.
Result: voidbox-in-VM at 421 us p50 vs pasta-in-netns at 107 us p50
is dominated (~300 us of the ~314 us gap) by VM transit (virtio-mmio
exits, KVM IRQ injection, vsock RPC), not by SLIRP-engine cost.
A genuinely apples-to-apples SLIRP-vs-SLIRP comparison (passt+qemu
vs voidbox+voidbox-VM) is the natural follow-up; this commit captures
the tooling so that follow-up can stand on a reproducible baseline.
Boots a minimal qemu guest carrying tools/crr-client and runs N TCP CRRs against a host TCP server. Two backends: --backend libslirp qemu's built-in -netdev user (libslirp) --backend passt qemu -netdev stream + passt(1) over UNIX socket Same workload + iteration count as scripts/bench-pasta.py and examples/crr_singleproc_bench.rs, so the four datapoints (host-direct, pasta-in-netns, qemu+libslirp, qemu+passt, voidbox+voidbox-SLIRP) are directly comparable on the same machine. The script auto-builds the initramfs from tools/qemu-init.sh + busybox + tools/crr-client, including virtio_net + failover modules from the host kernel so a stock distro kernel can probe the qemu virtio-net-pci device. Voidbox's slim kernel has them built-in and the insmod calls fail harmlessly. Result on the dev machine: host-direct 63 us p50 pasta (netns, no VM) 107 us p50 qemu+libslirp (in VM) 181 us p50 qemu+passt (in VM) 163 us p50 voidbox+voidbox-SLIRP 421 us p50 Voidbox is ~2.2x slower than the mature C SLIRPs in the same VM-attached configuration -- the genuine engine gap, independent of fork artefact (10x) and VM transit (which both sides pay).
Four small wins on the per-packet path between the SlirpBackend's
inject queue and the guest, identified by the SLIRP-vs-SLIRP
comparison (voidbox 421 us p50 vs qemu+passt 163 us p50 on the
single-process TCP CRR benchmark).
src/devices/virtio_net.rs::try_inject_rx
- Read avail.idx ONCE per call instead of per frame. The driver
only bumps it when adding new buffers; per-frame re-reads are
redundant guest-memory accesses.
- Replace 'let used_elem = [...].concat()' with a stack [u8; 8].
The previous code allocated a Vec<u8> per injected frame in the
hot path; the new code costs four byte copies and zero allocs.
- Write used.idx ONCE at the end of the batch rather than after
every frame. The virtio spec only requires a single update per
publish; per-frame writes were redundant guest-memory accesses.
- Return frames_injected (usize) so callers can pulse the IRQ
line conditionally on actual new RX work.
src/devices/virtio_net.rs::process_tx_queue
- Replace per-frame Vec::concat with stack [u8; 8] (same fix as
the RX path).
- Read each TX descriptor segment directly into the packet buffer
via packet.resize() + mem.read(&mut packet[off..]) instead of
allocating an intermediate Vec<u8> and extend_from_slice'ing.
Saves one allocation and one full memcpy per descriptor segment.
- Reuse a single Vec<u8> packet buffer with capacity 1600 across
all frames in the call instead of allocating fresh per frame.
- Batch used.idx update at end of the batch (same as RX).
src/vmm/mod.rs::net_poll_thread
- Track previous-cycle pending state. Pulse KVM_IRQ_LINE only
when (a) we actually injected new RX frames this cycle OR (b)
interrupt_status went from clear -> pending across cycles.
Previously the loop pulsed twice (assert level=1, then deassert
level=0) on every cycle while interrupt_status was non-zero,
even when the guest hadn't acked the previous pulse and no new
work had arrived. Skipping the pulse pair when there's nothing
new saves two ioctl(KVM_IRQ_LINE) calls per redundant cycle
(~5-10 us each on the CRR hot path).
Effect on the single-process CRR p50 (mean of 5 runs of 30
iterations each, voidbox+voidbox-SLIRP):
before: 421 us p50 mean
after: 380 us p50 mean (~10% improvement)
The IRQ pulse change is the dominant contributor; the RX/TX heap
allocation removals are correct cleanup but contribute below
sample variance. Voidbox's gap to qemu+passt (163 us) shrinks
from 2.6x to 2.3x; remaining gap candidates are MMIO exit cost,
KVM_IRQ_LINE vs irqfd, and SlirpBackend lock contention.
The voidbox net-poll thread was raising IRQ 10 with two ioctl(KVM_IRQ_LINE) calls per pulse: assert level=1, then deassert level=0. Each ioctl is a syscall (~few us each on KVM); on the TCP CRR hot path with multiple IRQ deliveries per connection, the ioctl pair became a measurable share of per-iteration cost. Replace with KVM_IRQFD: one eventfd registered with the in-kernel irqchip via vm_fd().register_irqfd(&eventfd, 10) at thread startup. Pulsing the IRQ is now a single 8-byte write to the eventfd; the kernel asserts the IRQ line directly without a userspace round-trip through ioctl(). The legacy KVM_IRQ_LINE path is kept as a fallback when irqfd registration fails (kernel without irqfd support, irqchip routing not initialised). In normal operation the eventfd succeeds at startup and the legacy ioctls never run. Effect on the single-process CRR p50 (mean over 5 runs of 30 iterations, voidbox+voidbox-SLIRP): before this commit: ~380 us p50 after this commit: ~335 us p50 (~12% reduction) Cumulative with the previous virtio-net hot-path cleanups: baseline: 421 us p50 after all fixes: ~335 us p50 (~20% cumulative reduction) Voidbox's gap to qemu+passt (163 us) shrinks from 2.6x to 2.0x.
Without ioeventfd, every guest TX (write to QUEUE_NOTIFY MMIO with value=1) forces a KVM_RUN exit: vCPU thread dispatches into virtio-net's write_mmio handler, calls process_tx_queue, then re-enters KVM_RUN. On the TCP CRR hot path with multiple TX per connection that's a few microseconds of pure VM-exit overhead per packet on top of the actual network work. Register the eventfd at MMIO addr 0xd000_0050 with datamatch=1 (TX queue notify only). Now KVM consumes the matching MMIO write in-kernel and signals the eventfd; vCPU continues running uninterrupted. Net-poll thread sees the eventfd alongside flow events on the existing EpollDispatch (under a token in a tag space that doesn't collide with PROTO_TAG_*), drains it, and calls process_tx_queue on its own schedule. Notifies for queue 0 (RX, value=0) still take the slow path through the MMIO write handler — they're rare (only when guest adds new RX buffers) so the optimisation isn't needed there. Falls back to the synchronous MMIO-exit path if eventfd creation or KVM_IOEVENTFD registration fails. Effect on the single-process CRR p50 (mean over 5 runs of 30 iterations, voidbox+voidbox-SLIRP): before this commit: ~335 us p50 after this commit: ~278 us p50 (~17% reduction) Cumulative across the recent perf series: baseline: 421 us p50 + virtio-net cleanups: ~380 us p50 + KVM_IRQFD: ~335 us p50 + KVM_IOEVENTFD: ~278 us p50 (~34% cumulative) Voidbox's gap to qemu+passt (163 us) shrinks from 2.6x to 1.7x.
Restructures the host->guest RX path to eliminate the
Arc<Mutex<VirtioNetDevice>> contention between the net-poll thread
and the vCPU thread. Inspired by the user-suggested Option B:
"net-poll -> rx_queue[vCPU] -> esa vCPU consume".
Before:
net-poll thread:
let mut g = net_dev.lock(); // takes device mutex
g.try_inject_rx(mem); // descriptor walk + writes
drop(g);
pulse_irq();
vCPU thread on MMIO exit:
let g = net_dev.lock(); // waits for net-poll
g.mmio_read(...);
After:
net-poll thread:
drain backend frames into a Vec; // backend mutex only
push each frame to pending_rx; // lock-free SegQueue
pulse_irq(); // never touches device mutex
vCPU thread on MMIO exit:
let mut g = net_dev.lock(); // uncontended now
g.flush_pending_rx(mem); // descriptor writes here
g.mmio_read/mmio_write(...);
Net-poll's hot path no longer holds the VirtioNetDevice mutex at
all -- it only acquires the SLIRP backend Arc independently. vCPU's
MMIO exits do the descriptor work in-context, paying for it once per
exit but never waiting on a held lock.
Implementation:
src/devices/virtio_net.rs
- new field pending_rx: Arc<crossbeam_queue::SegQueue<Vec<u8>>>
- pending_rx() accessor returns a clone of the Arc
- slirp_arc() exposes the backend Arc for direct net-poll access
- new method flush_pending_rx(&mut self, mem) drains the SegQueue
and writes RX descriptors using the same loop as try_inject_rx
- try_inject_rx is now a thin wrapper that calls a new shared
helper write_frames_to_rx_ring; same behaviour, structured
so flush_pending_rx can share the descriptor-writing logic.
src/vmm/mod.rs::net_poll_thread
- Cache pending_rx + slirp Arcs once at thread startup; never
touch the VirtioNetDevice mutex on the per-cycle path.
- Drain backend frames into a reusable Vec, wrap each with a
virtio-net header, push to the SegQueue, then pulse the IRQ.
src/vmm/cpu.rs (MMIO dispatch)
- Call guard.flush_pending_rx(guest_memory) at the top of the
virtio-net MMIO read AND write handlers. Materialises any
frames the net-poll thread queued since the last MMIO exit.
Adds: crossbeam-queue = "0.3".
Effect on the single-process CRR p50 (mean over 5 runs of 30
iterations, voidbox+voidbox-SLIRP):
before this commit: ~278 us p50
after this commit: ~265 us p50 (~5% reduction)
Modest improvement on the single-vCPU benchmark we have available --
the win is mostly architectural (eliminates a contention point that
will become more meaningful with multi-vCPU guests, higher pps, and
parallel TX/RX paths).
Cumulative across the whole perf series:
baseline: 421 us p50
+ virtio-net cleanups: ~380 us p50
+ KVM_IRQFD: ~335 us p50
+ KVM_IOEVENTFD: ~278 us p50
+ Option B SegQueue: ~265 us p50 (~37% cumulative)
Voidbox's gap to qemu+passt (163 us) is now ~1.6x.
Wraps the device's interrupt_status register in Arc<AtomicU32> so the
net-poll thread can read and update it without taking the device
mutex. Three concrete benefits:
1. has_pending_interrupt() is now a single relaxed atomic load on
&self -- safe to call from any thread, no lock, no contention.
2. The net-poll thread caches a clone of the Arc at startup and
uses it directly for its idle-cycle 'do I need to pulse the IRQ?'
check, removing one mutex acquisition per cycle.
3. interrupt_status |= 1 (set by RX inject) and interrupt_status &=
!value (cleared by guest's INTERRUPT_ACK MMIO write) are now
fetch_or / fetch_and atomic operations -- no read-modify-write
race between the vCPU thread and the net-poll thread.
The vCPU thread's MMIO read of INTERRUPT_STATUS still goes through
the device mutex via the existing dispatcher, but the underlying
operation is now a pure atomic load -- a follow-up that lets the
dispatcher skip the lock for read-only MMIO accesses gets a cleaner
path because the field no longer needs synchronisation through the
mutex.
Single-vCPU CRR is within sample noise of the previous measurement
(~265 us p50 -> ~289 us across 5 runs of 30 iterations); the win is
mostly architectural rather than measurable on this workload. Real
benefit shows up with multi-vCPU guests, higher pps, or workloads
where the net-poll and vCPU threads contend more aggressively.
Collects the SLIRP-vs-SLIRP / vs-pasta diagnostic tooling under one
directory. Five files relocate, no behaviour change:
scripts/bench-pasta.py -> tools/perf-harness/bench-pasta.py
scripts/bench-compare-pasta.py -> tools/perf-harness/bench-compare-pasta.py
scripts/bench-qemu-slirp.sh -> tools/perf-harness/bench-qemu-slirp.sh
tools/crr-client.c -> tools/perf-harness/crr-client.c
tools/qemu-init.sh -> tools/perf-harness/qemu-init.sh
Updates path references in:
- bench-qemu-slirp.sh (uses $SCRIPT_DIR for qemu-init.sh location;
updated busybox extraction to climb two dirs up to repo root)
- examples/crr_singleproc_bench.rs (doc + error message paths)
- docs/passt-comparison.md (usage examples + extended example block
that now also covers bench-qemu-slirp.sh and crr_singleproc_bench)
Smoke-tested after the move:
- tools/perf-harness/bench-pasta.py --iterations 1 ... passes
- tools/perf-harness/bench-qemu-slirp.sh --backend libslirp passes
Eight follow-up fixes from PR #81 review: src/vmm/mod.rs: Extract `setup_tx_notify_ioeventfd` helper and gate the entire IOEVENTFD path on `epoll_arc.is_some()`. Fixes the original safety concern: the previous code registered KVM_IOEVENTFD even when no epoll dispatcher was available, which would have left guest TX notifies trapped in-kernel with no userspace drain — a silent hang. The helper rolls back the epoll registration if KVM_IOEVENTFD registration fails, so the two halves succeed or fail together. examples/crr_singleproc_bench.rs: Switch the host-side accept thread to non-blocking accept with a deadline check so the example never hangs forever if the guest fails to connect. Initial Copilot suggestion of a 2 ms sleep inflated each guest CRR sample by ~1.8 ms (sleep latency directly added to per-iter accept-pickup time). Reduced to 50 µs to keep the sample noise below the metric resolution. tools/perf-harness/bench-pasta.py: - `detect_host_gateway` now parses the route line by `via` keyword instead of indexing parts[2], so non-standard route formats don't silently pick up the wrong field. - CRR timer started before `srv.accept()` to match the voidbox-network-bench `crr_echo_server` semantics. tools/perf-harness/bench-qemu-slirp.sh: - Replace `time.sleep(60)` with `threading.Event().wait()` so the host echo server stays alive for the entire qemu run instead of timing out at 60 s. - Add fail-fast bind error handling so port collisions surface immediately instead of producing a confusing "no result" later. tools/perf-harness/qemu-init.sh: Derive the netmask from the CIDR prefix instead of hardcoding 255.255.255.0, so non-/24 networks work. tools/perf-harness/bench-compare-pasta.py: Remove unused `sign` variable. docs/passt-comparison.md: Update path reference from `scripts/` to `tools/perf-harness/`. Verified: voidbox single-process CRR p50 stays at ~280-310 µs (within noise of pre-fix baseline) and `cargo test --test network_baseline` passes 24/24.
9394dd6 to
3c5da08
Compare
Replace `std::mem::take(&mut *queue)` with an in-place
`extend_from_slice` + `clear()` against a scratch Vec owned by
`SlirpBackend`. The previous pattern moved the queue's allocation
out and left a fresh `Vec::new()` (cap=0) behind, forcing the next
`push_ready_events` to grow `extend_from_slice` from cap=0 every
cycle.
Heaptrack on the single-process CRR bench (30 iters) measured
this single callsite as ~half of all allocations during the run:
before: push_ready_events 4843 allocs (49% of total)
drain_to_guest 4776 allocs (48% of total)
total 12618 allocs
after: push_ready_events gone from top callers
drain_to_guest 3957 allocs (still hot, downstream)
total 6885 allocs (-45%)
p50 CRR latency is unchanged (~270 µs); the wall-clock floor is
elsewhere on this workload. The win is reduced allocator churn
(GC pressure, jitter on bulk paths, fewer slow-path mallocs under
sustained load) — visible in the throughput bench rather than CRR
microbench.
The `pending_events` Mutex<Vec> is also pre-sized to
`EVENTS_PRESIZE = 128` at construction so the very first push
doesn't reallocate.
The SLIRP backend's per-second new-connection rate limit
(`max_connections_per_second`, default 50/s) and concurrent-
connection ceiling (`max_concurrent_connections`, default 64) are
production anti-DoS defaults baked into `LocalSandbox`. They are
hostile to microbenches that intentionally open hundreds of
connections in a tight loop — at 51 connects/s the limiter starts
returning RST to the guest, which crr-client sees as
`ECONNREFUSED` on its very next connect and exits with rc=3.
Reproduced as the "100-iter failure" in `crr_singleproc_bench`:
30 iters worked, 60 iters did not; the threshold was the 50/s
limit, not anything in the network stack itself.
Surface the two ceilings on `Sandbox::local()` as builder methods:
.network_max_connections_per_second(u32::MAX)
.network_max_concurrent_connections(usize::MAX)
`None` keeps the production defaults, so this is purely additive.
The bench now uses both. 500-iter run reproduces clean
(p50 268 µs, p99 1.6 ms, host accepts 500/500).
Both `flush_pending_rx` and `try_inject_rx` previously built a
fresh `Vec<Vec<u8>>` on every MMIO exit and handed it to
`write_frames_to_rx_ring`, which consumed it by value. The
pattern dropped the outer-Vec allocation and forced the next call
to grow it from cap=0 — heaptrack on the CRR microbench measured
the flush_pending_rx site at 173 calls / 108 MB peak, the largest
remaining alloc consumer after the SLIRP `ready_scratch` fix.
`write_frames_to_rx_ring` now takes `&mut Vec<Vec<u8>>` and drains
in place via `drain(..)` / `append`, so callers reuse a long-lived
scratch buffer:
- `flush_pending_rx` uses a new `flush_scratch` field on
`VirtioNetDevice`, populated from `pending_rx` (SegQueue) and
cleared at end.
- `try_inject_rx` reuses the existing `rx_scratch` field that
was already paired with `get_rx_frames`; the trailing
`mem::take` in `get_rx_frames` is now followed by a
`clear()` + restore at the end of `try_inject_rx`, so the
capacity persists across the round-trip.
Heaptrack on 100-iter CRR:
before this commit: 6885 allocs / 30 iters = 229/iter
after this commit: 18926 allocs / 100 iters = 189/iter
Aggregate from the original baseline:
baseline (before all fixes): ~421 allocs/iter
this commit: ~189 allocs/iter (-55%)
p50 latency unchanged at ~275 µs as expected — alloc reduction
shows up in throughput and tail-latency stability, not the CRR
floor.
`relay_tcp_nat_data` builds a temporary `Vec<Vec<u8>>` per call because the relay can't push directly to `inject_to_guest` while iterating `flow_table` (both are `&mut self`). The previous pattern allocated a fresh `Vec::new()` every cycle, which heaptrack flagged as the biggest remaining contributor inside `drain_to_guest`'s call tree after the prior `ready_scratch` and `flush_scratch` fixes. Move the buffer onto `SlirpBackend` as `relay_frames_scratch` and use the standard `mem::take` → process → restore pattern so the buffer's capacity persists across `drain_to_guest` calls. The two trailing `inject_to_guest.append(&mut frames_to_inject)` sites already preserve capacity (Vec::append leaves the source empty but with its allocation intact); only the entry-point `Vec::new()` was discarding work. Cumulative impact on the 100-iter CRR microbench: baseline (before any of these fixes): ~421 allocs/iter after ready_scratch + flush_scratch: ~189 allocs/iter after relay_frames_scratch (this PR): ~93 allocs/iter (-78%) p50 latency continues at ~275 µs; the floor is dominated by KVM-exit / wakeup costs, not allocator churn. The win shows up under sustained load where reduced allocator pressure improves tail-latency stability and per-frame jitter.
Three of the relay functions called from `drain_to_guest` (`relay_tcp_nat_data`, `relay_icmp_echo`, `relay_udp_flows`) each built a per-call `Vec<FlowKey>` to side-step the `&mut self` / `flow_table` borrow conflict. The Vecs were allocated, populated, drained, and dropped on every cycle. The UDP relay built two — one for the stale-sweep, one for the readiness loop. Add a single `flow_keys_scratch: Vec<FlowKey>` field on `SlirpBackend` and rotate it through all four sites with the mem::take → process → restore pattern (the relays run sequentially inside `drain_to_guest`, so one buffer suffices). Each iteration uses `Vec::drain(..)` instead of for-by-value so capacity is preserved across the consume. Heaptrack on the 100-iter CRR microbench: before this commit: 9296 allocs (~93/iter) after this commit: 4103 allocs (~41/iter) temporary allocs: 5546 → 574 (-90%) Cumulative from the original baseline (start of this round): ~421 allocs/iter → ~41 allocs/iter (-90%) p50 latency unchanged at ~275 µs as predicted; the wall-clock floor is dominated by KVM exits / vCPU wakeups. The gain shows up as reduced allocator pressure on bulk paths and fewer slow-path mallocs under sustained load. Top remaining alloc callsites are now per-frame `Vec<u8>` from `build_tcp_packet_static` (one allocation per TCP frame) and TX queue frame parsing — both intrinsic to the protocol shape; further reduction needs a pool/arena, not a scratch hoist.
Same fix as `crr_singleproc_bench`: the bench's CRR phase opens 30 connections in <1s, which trips the production SLIRP rate limiter (50 conn/s) and surfaces as a 2 s "crr echo channel receive error" instead of a real number. Use the new `Sandbox::local()` rate-limit knobs to lift both ceilings (max_connections_per_second + max_concurrent_connections) explicitly. Production sandboxes are unaffected — the lift is opt-in.
Plan doc for the next perf round. After #81's user-space alloc reductions exhausted (-90% allocs/iter, p50 unchanged), the remaining floor is kernel↔userspace transitions, MMIO exits, and single-queue serialization. Three experiments in scope, ranked by risk × payoff: 1. io_uring for SLIRP host-socket I/O — start here 2. splice() / sendfile() zero-copy on bulk paths 3. MSI-X virtio + multi-queue for vCPU scaling Non-goal: TAP + passt-style host bypass. Routing through an external passt would close the latency gap to passt but moves the DNS interception, port-forwarding, deny-list, and rate-limiting feature surface out of voidbox — and loses the in-process observability we currently get from instrumenting SLIRP directly. Full SLIRP-path observability is a hard requirement. Each experiment lands as its own commit, gated behind a Cargo feature so the #81 baseline can A/B against it without a revert. Measurements use the harness shipped in #81.
First commit on the architectural-experiments branch (#83). Adds a `UringBatch` wrapper around `io_uring::IoUring` with the submit / drain shape the SLIRP relay will use to batch host-socket recv / send into single `io_uring_enter` round-trips. Key shape: - One `UringBatch` is single-owner: the SLIRP `net_poll_thread` constructs and drives one. No locking, no cross-thread sharing. - SQEs are tagged with `(UringOp, correlation_id)` packed into `user_data` so the completion drain routes a CQE back to its originating flow without a side table. Low 32 bits = correlation id, top 32 bits = op tag. - `submit_recv` / `submit_send` are `unsafe` because the kernel references the user buffer asynchronously; the caller's safety contract requires `buf` to outlive the matching CQE. - The existing `EpollDispatch` keeps owning the readiness signal — io_uring replaces only the data-plane syscalls, not the wake-up. Two layers stay separable so the feature can be toggled off without touching the relay state machine. Behavior unchanged: nothing wires this in yet. Cargo feature `io-uring` (off by default) gates both the new module and the `io-uring = "0.7"` dependency. Module is `#![allow(dead_code)]` for now; the next commit on this branch wires the relay TCP recv / send paths through it and removes the allow. Tests: - 4 unit tests in `src/network/uring.rs` cover user-data round trip + a real `submit_send` -> `submit_recv` cycle across a `socketpair` (skipped on kernels without io_uring). - `cargo test --features io-uring --lib`: 381 passed. - `cargo test --test network_baseline` (default features): 24/24. - `cargo clippy --all-targets [-- -D warnings]` clean both with and without the feature. Methodology per `docs/perf-architectural-experiments.md`: each experiment lands as one feature-gated commit so the #81 baseline can A/B against it without a revert. This is the infrastructure commit; the next one wires + measures.
Companion to `crr_singleproc_bench`: drives M concurrent crr-client processes in the same guest so the SLIRP relay sees N>1 ready flows per `net_poll_thread` cycle. The single-flow microbench can't see io_uring batching or multi-queue wins because there's nothing to batch / parallelize with one ready flow at a time; this bench is the workload the architectural experiments on this branch (#83) need. Per-flow `crr-client` writes its summary line to its own `/tmp/crr_results/$i.txt`; the trailing shell loop concatenates all M lines for the host to parse. Aggregation reports median-of-p50s, max p99, mean-of-means, and aggregate qps. Note: busybox-static lacks `seq`, so the flow-id list is materialized on the host and inlined into the shell command. ## Baseline (this branch's tip = #81 + io_uring scaffold) Single net_poll_thread, no architectural changes wired: | M | Median p50 | Max p99 | Aggregate qps | |---|-----------:|--------:|--------------:| | 1 | 275 µs | ~2 ms | ~3636 | | 2 | 473 µs | 12.9 ms | 2173 | | 4 | 732 µs | 13.2 ms | 2370 | | 8 | 2043 µs | 14.5 ms | 2242 | Reading: - Aggregate qps saturates at ~2200-2400 regardless of M — the single net_poll_thread is the bottleneck. - Per-flow p50 grows ~linearly with M (M=8 each flow takes 7.4× the M=1 p50). - p99 jumps to 12-14 ms at M=2 already; tail-latency is dominated by per-flow head-of-line blocking through the single epoll loop. This is exactly the workload io_uring batching, splice, and multi-queue should move. The io_uring wiring lands in the next commit on this branch with measurements against this table.
Summary
Originally a passt/pasta comparison harness — has since grown into a full SLIRP perf-improvement series driven by the harness measurements and a heaptrack-driven follow-up round.
Final results
Measured on the same Fedora 43 / KVM host, voidbox-network-bench
--iterations 3, single-process CRR microbench--iterations 100:voidbox-network-bench, busybox-nc-fork-bound)Reading:
tcp_crr_latency_us_p50fromvoidbox-network-benchis dominated bybusybox-ncfork+exec per iteration, not SLIRP. The single-process CRR bench (~275 µs) reflects the actual NAT path.What's new
Harness (the original PR scope)
tools/perf-harness/bench-pasta.py— drives the same workload shape asvoidbox-network-bench(tcp_throughput_g2h_mbps,tcp_rr_latency_us_p50/p99,tcp_crr_latency_us_p50) against pasta running in a network namespace. Outputs JSON in the sameReportshape.tools/perf-harness/bench-compare-pasta.py— reads two JSONs and emits a markdown side-by-side. Auto-detects which file is voidbox vs pasta via thebackendfield.tools/perf-harness/bench-qemu-slirp.sh+qemu-init.sh+crr-client.c— qemu-side of a proper SLIRP-vs-SLIRP head-to-head (qemu+libslirp / qemu+passt vs voidbox+SLIRP).examples/crr_singleproc_bench.rs— voidbox-side single-process CRR diagnostic that pairs with the Ccrr-client. Isolates the NAT path from the original bench's per-iterationncfork+exec overhead.docs/passt-comparison.md— usage + methodology caveats.Perf round 1 — wall-clock CRR optimizations
Five commits driven by the harness exposing a 122× CRR gap that turned out to be
net_poll_thread's 5 ms active cadence:Arc<Mutex<VirtioNetDevice>>contention against vCPU)interrupt_statusasArc<AtomicU32>(allows concurrent ack between vCPU and net-poll thread)Perf round 2 — heaptrack-driven allocation hoisting
heaptrack on the same workload found that 97% of allocations during the bench were per-cycle
Vecgrowth in the SLIRP / virtio-net hot path — primarilymem::take(&mut *queue)-style discards of buffer capacity. Four surgical commits hoist scratch Vecs to long-lived fields:ready_scratch(events Vec) — replacesmem::takeonpending_eventswithclear()+extend_from_slice.flush_scratch(RX-inject Vec<Vec>) —write_frames_to_rx_ringnow takes&mut Vecand drains in place.relay_frames_scratch—relay_tcp_nat_data's deferred frame Vec.flow_keys_scratch— single sharedVec<FlowKey>rotated across TCP/ICMP/UDP relays via mem::take pattern.Per-step allocation reduction on the 100-iter CRR bench:
ready_scratchflush_scratchrelay_frames_scratchflow_keys_scratchp50 latency unchanged at ~275 µs as predicted; the wall-clock floor is dominated by KVM exits / vCPU wakeups, not allocator churn.
Bench infrastructure fixes
Sandbox::local()builder methods (network_max_connections_per_second,network_max_concurrent_connections). Production defaults (50 conn/s, 64 concurrent) hard-rejected the bench's >50 connect/s pattern; bothcrr_singleproc_benchandvoidbox-network-benchnow lift both ceilings explicitly. Surfaced as a 100-iter "Connection refused" failure during the heaptrack work.crr_singleproc_benchaccept-loop: 50 µs non-blocking poll instead of 2 ms sleep (the latter inflated each guest CRR sample by ~1.8 ms, an 8× regression in earlier review-fix versions).bench-qemu-slirp.sh: server stays alive for full qemu run (was 60 s); fail-fast on bind error.bench-pasta.py: gateway parsed byviakeyword; CRR timer starts beforeaccept()to matchvoidbox-network-benchsemantics.qemu-init.sh: netmask derived from CIDR prefix (was hardcoded /24).How pasta replaces qemu+passt
pasta is the same forwarding/NAT engine as passt minus the qemu glue — runs in a network namespace, reachable as
pasta -- COMMAND. The lower-friction first cut. Throughput numbers are not directly comparable (pasta has no VM transit) but CRR latency is apples-to-apples because it's dominated by NAT-table operations on both sides. A proper qemu+passt rig also exists intools/perf-harness/bench-qemu-slirp.sh.Usage
Test plan
cargo fmt --all -- --checkcleancargo clippy --workspace --all-targets --all-features -- -D warningscleancargo test --test network_baseline— 24/24examples/crr_singleproc_bench— 100-iter, 500-iter clean (host accepts N/N)voidbox-network-bench --iterations 3post-round-2: g2h 11707 Mbps, RR p50/p99 = 2/18 µsFollow-ups (not in this PR)
build_tcp_packet_staticand TX-queue frame parsing; eliminating those needs a pool/arena, not a scratch hoist.bench-qemu-slirp.shis the harness; perf comparison is documented separately.